The hands-on exercise for this week focuses on: 1) estimating a topic model ; 2) interpreting and visualizing results. Remember that you will need to: 1) comment your code and 2) write out the interpretation of your results.
You will learn how to:
quanteda and
topicmodels packageBefore proceeding, we’ll load the packages we will need for this tutorial.
library(tidyverse) # loads dplyr, ggplot2, and others
library(stringr) # to handle text elements
library(tidytext) # includes set of functions useful for manipulating text
## Warning: package 'tidytext' was built under R version 4.3.2
library(topicmodels) # to estimate topic models
## Warning: package 'topicmodels' was built under R version 4.3.3
library(gutenbergr) # to get text data
## Warning: package 'gutenbergr' was built under R version 4.3.3
library(scales)
library(tm)
## Warning: package 'tm' was built under R version 4.3.3
library(ggthemes) # to make your plots look nice
## Warning: package 'ggthemes' was built under R version 4.3.2
library(readr)
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.3.2
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
library(quanteda.textmodels)
## Warning: package 'quanteda.textmodels' was built under R version 4.3.3
You may need to install the preText package if you haven’t done so yet. For that you will need to run the next code chunk (it is currently set to ‘eval=F’, which tells R ‘do not execute this code chunk’). That package is not readily available on through RStudio directly. It needs to be downloaded from the Github repository set up by its creater Matthew J Denny. We can do that using the command install_github(). This command is part of the ‘devtools’ package, which you will need to install as well (if you haven’t done so already). The devtools package is directly available through R so it can be installed with the usual command install_packages.
#install_package(devtools)
devtools::install_github("matthewjdenny/preText")
library(preText)
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "ndiMatrix" of class "replValueSp"; definition not updated
We’ll be using data from Alexis de Tocqueville’s “Democracy in America.”
We have already downloaded some data for you, but we also included the code to download it yourself (it is currently set to ‘eval=F’ so it won’t run unless you remove the eval=F argument or you run the chunk directly.
The code downloads these data, both Volume 1 and Volume 2, and combine them into one data frame. For this, we’ll be using the gutenbergr package, which allows the user to download text data from over 60,000 out-of-copyright books. The ID for each book appears in the url for the book selected after a search on https://www.gutenberg.org/ebooks/.
This example is adapted by Text Mining with R: A Tidy Approach by Julia Silge and David Robinson.
Here, we see that Volume of Tocqueville’s “Democracy in America” is stored as “815”. A separate search reveals that Volume 2 is stored as “816”.
#USING THIS DATA
tocq <- gutenberg_download(c(815, 816),
meta_fields = "author")
## Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
## Using mirror http://aleph.gutenberg.org
Or we can read the dataset we already downloaded for you in the following way:
tocq <- readRDS(gzcon(url("https://github.com/cjbarrie/CTA-ED/blob/main/data/topicmodels/tocq.RDS?raw=true")))
Once we have read in these data, we convert it into a different data
shape: the document-term-matrix. We also create a new columns, which we
call “booknumber” that recordss whether the term in question is from
Volume 1 or Volume 2. To convert from tidy into “DocumentTermMatrix”
format we can first use unnest_tokens() as we have done in
past exercises, remove stop words, and then use the
cast_dtm() function to convert into a “DocumentTermMatrix”
object.
#convert to different data shape DTM
tocq_words <- tocq %>%
mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
unnest_tokens(word, text) %>% #lowercase words and remove punctuation
filter(!is.na(word)) %>% #remove stop words
count(booknumber, word, sort = TRUE) %>% #making new column (booknumber) for whether term is from v1 or v2 of the book
ungroup() %>%
anti_join(stop_words)#again stop words?
## Joining with `by = join_by(word)`
tocq_dtm <- tocq_words %>%
cast_dtm(booknumber, word, n) #convert to dtm
tm::inspect(tocq_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 11989)>>
## Non-/sparse entries: 17420/6558
## Sparsity : 27%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs country democratic government laws nations people power society time
## DiA1 353 213 531 395 231 499 540 291 309
## DiA2 167 561 162 133 313 360 263 241 309
## Terms
## Docs united
## DiA1 556
## DiA2 227
We see here that the data are now stored as a “DocumentTermMatrix.” In this format, the matrix records the term (as equivalent of a column) and the document (as equivalent of row), and the number of times the term appears in the given document. Many terms will not appear in the document, meaning that the matrix will be stored as “sparse,” meaning there will be a preponderance of zeroes. Here, since we are looking only at two documents that both come from a single volume set, the sparsity is relatively low (only 27%). In most applications, the sparsity will be a lot higher, approaching 99% or more.
Estimating our topic model is then relatively simple. All we need to do if specify how many topics that we want to search for, and we can also set our seed, which is needed to reproduce the same results each time (as the model is a generative probabilistic one, meaning different random iterations will produce different results).
#set seed and specify how many topicsto search for (10)
tocq_lda <- LDA(tocq_dtm, k = 10, control = list(seed = 1234))
After this we can extract the per-topic-per-word probabilities, called “β” from the model:
#extract the beta
tocq_topics <- tidy(tocq_lda, matrix = "beta")
head(tocq_topics, n = 10)#show top 10
## # A tibble: 10 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 democratic 0.00735
## 2 2 democratic 0.00494
## 3 3 democratic 0.0169
## 4 4 democratic 0.00200
## 5 5 democratic 0.00434
## 6 6 democratic 0.00509
## 7 7 democratic 0.00140
## 8 8 democratic 0.00840
## 9 9 democratic 0.0138
## 10 10 democratic 0.00968
We now have data stored as one topic-per-term-per-row. The betas listed here represent the probability that the given term belongs to a given topic. So, here, we see that the term “democratic” is most likely to belong to topic 4(topic 3 using our data). Strictly, this probability represents the probability that the term is generated from the topic in question.
We can then plots the top terms, in terms of beta, for each topic as follows:
#plot beta for each topic (top terms)
tocq_top_terms <- tocq_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
tocq_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 4) +
scale_y_reordered() +
theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family
## not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
But how do we actually evaluate these topics? Here, the topics all seem pretty similar.
Well, one way to evaluate the performance of unspervised forms of classification is by testing our model on an outcome that is already known.
Here, two topics that are most obvious are the ‘topics’ Volume 1 and Volume 2 of Tocqueville’s “Democracy in America.” Volume 1 of Tocqueville’s work deals more obviously with abstract constitutional ideas and questions of race; Volume 2 focuses on more esoteric aspects of American society. Listen an “In Our Time” episode with Melvyn Bragg discussing Democracy in America here.
Given these differences in focus, we might think that a generative model could accurately assign to topic (i.e., Volume) with some accuracy.
First let’s have a look and see whether there really are words obviously distinguishing the two Volumes.
tidy_tocq <- tocq %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_tocq %>%
count(word, sort = TRUE)
## # A tibble: 11,989 × 2
## word n
## <chr> <int>
## 1 people 859
## 2 power 803
## 3 united 783
## 4 democratic 774
## 5 government 693
## 6 time 618
## 7 nations 544
## 8 society 532
## 9 laws 528
## 10 country 520
## # ℹ 11,979 more rows
bookfreq <- tidy_tocq %>%
mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(booknumber, word) %>%
group_by(booknumber) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(booknumber, proportion)
ggplot(bookfreq, aes(x = DiA1, y = DiA2, color = abs(DiA1 - DiA2))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
theme_tufte(base_family = "Helvetica") +
theme(legend.position="none",
strip.background = element_blank(),
strip.text.x = element_blank()) +
labs(x = "Tocqueville DiA 2", y = "Tocqueville DiA 1") +
coord_equal()
## Warning: Removed 6100 rows containing missing values (`geom_point()`).
## Warning: Removed 6101 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
We see that there do seem to be some marked distinguishing characteristics. In the plot above, for example, we see that more abstract notions of state systems appear with greater frequency in Volume 1 while Volume 2 seems to contain words specific to America (e.g., “north” and “south”) with greater frequency. The way to read the above plot is that words positioned further away from the diagonal line appear with greater frequency in one volume versus the other.
In the below, we first separate the volumes into chapters, then we repeat the same procedure as above. The only difference now is that instead of two documents representing the two full volumes of Tocqueville’s work, we now have 132 documents, each representing an individual chapter. Notice now that the sparsity is much increased: around 96%.
tocq <- tocq %>%
filter(!is.na(text))
# Divide into documents, each representing one chapter
tocq_chapter <- tocq %>%
mutate(booknumber = ifelse(gutenberg_id==815, "DiA1", "DiA2")) %>%
group_by(booknumber) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, booknumber, chapter)
# Split into words
tocq_chapter_word <- tocq_chapter %>%
unnest_tokens(word, text)
# Find document-word counts
tocq_word_counts <- tocq_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
tocq_word_counts
## # A tibble: 44,633 × 3
## document word n
## <chr> <chr> <int>
## 1 DiA1_1 people 204
## 2 DiA1_1 government 198
## 3 DiA1_3 union 198
## 4 DiA1_1 power 184
## 5 DiA1_1 union 157
## 6 DiA1_1 public 155
## 7 DiA1_1 united 154
## 8 DiA1_1 federal 139
## 9 DiA1_3 united 139
## 10 DiA1_3 south 134
## # ℹ 44,623 more rows
# Cast into DTM format for LDA analysis
tocq_chapters_dtm <- tocq_word_counts %>%
cast_dtm(document, word, n)
tm::inspect(tocq_chapters_dtm)
## <<DocumentTermMatrix (documents: 79, terms: 11014)>>
## Non-/sparse entries: 44633/825473
## Sparsity : 95%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs americans country democratic government nations people power society
## DiA1_1 58 99 85 198 91 204 184 86
## DiA1_2 86 88 92 63 46 106 73 85
## DiA1_3 106 76 10 110 41 60 62 38
## DiA2_46 3 4 4 1 4 3 2 6
## DiA2_5 1 2 11 0 5 7 2 4
## DiA2_60 9 10 10 0 12 16 11 15
## DiA2_63 1 4 31 1 13 20 6 16
## DiA2_73 2 4 9 33 19 12 34 3
## DiA2_75 1 1 22 20 12 6 14 11
## DiA2_76 5 11 10 24 12 31 27 16
## Terms
## Docs time united
## DiA1_1 83 154
## DiA1_2 52 123
## DiA1_3 90 139
## DiA2_46 9 2
## DiA2_5 6 4
## DiA2_60 8 6
## DiA2_63 11 7
## DiA2_73 16 0
## DiA2_75 12 0
## DiA2_76 50 88
We then re-estimate the topic model with this new DocumentTermMatrix object, specifying k equal to 2. This will enable us to evaluate whether a topic model is able to generatively assign to volume with accuracy.
#evaluate whether a topic model can generatively assign to volume with accuracy
tocq_chapters_lda <- LDA(tocq_chapters_dtm, k = 2, control = list(seed = 1234))
After this, it is worth looking at another output of the latent dirichlet allocation procedure. The γ probability represents the per-document-per-topic probability or, in other words, the probability that a given document (here: chapter) belongs to a particular topic (and here, we are assuming these topics represent volumes).
The gamma values are therefore the estimated proportion of words within a given chapter allocated to a given volume.
#gamma is porporition of words within a given chapter allocated to given vol
tocq_chapters_gamma <- tidy(tocq_chapters_lda, matrix = "gamma")
tocq_chapters_gamma
## # A tibble: 158 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 DiA1_1 1 1.00
## 2 DiA1_3 1 0.667
## 3 DiA1_2 1 0.999
## 4 DiA2_76 1 1.00
## 5 DiA2_60 1 0.385
## 6 DiA2_16 1 0.0000771
## 7 DiA2_22 1 0.000139
## 8 DiA2_64 1 0.0000727
## 9 DiA2_73 1 0.0853
## 10 DiA2_28 1 0.562
## # ℹ 148 more rows
Now that we have these topic probabilities, we can see how well our unsupervised learning did at distinguishing the two volumes generatively just from the words contained in each chapter.
# First separate the document name into title and chapter
tocq_chapters_gamma <- tocq_chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
#make classifications by top gamma?
tocq_chapter_classifications <- tocq_chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
#make book topic by title (volume of book)
tocq_book_topics <- tocq_chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
#join the tocq_book_topics by their volume
tocq_chapter_classifications %>%
inner_join(tocq_book_topics, by = "topic") %>%
filter(title != consensus)
## # A tibble: 8 × 5
## title chapter topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 DiA2 76 1 1.00 DiA1
## 2 DiA2 28 1 0.562 DiA1
## 3 DiA2 27 1 0.703 DiA1
## 4 DiA2 54 1 0.740 DiA1
## 5 DiA2 53 1 0.516 DiA1
## 6 DiA2 51 1 0.616 DiA1
## 7 DiA2 52 1 0.659 DiA1
## 8 DiA2 44 1 0.520 DiA1
# Look document-word pairs were to see which words in each documents were assigned to a given topic
assignments <- augment(tocq_chapters_lda, data = tocq_chapters_dtm)
assignments
## # A tibble: 44,633 × 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 DiA1_1 people 204 1
## 2 DiA1_3 people 60 1
## 3 DiA1_2 people 106 1
## 4 DiA2_76 people 31 1
## 5 DiA2_60 people 16 2
## 6 DiA2_16 people 7 2
## 7 DiA2_22 people 2 2
## 8 DiA2_64 people 8 2
## 9 DiA2_73 people 12 2
## 10 DiA2_28 people 4 1
## # ℹ 44,623 more rows
#separate assignments by title and chapter
assignments <- assignments %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(tocq_book_topics, by = c(".topic" = "topic"))
assignments %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) + #sorts the variables by high (red)
geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
theme_tufte(base_family = "Helvetica") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words assigned to",
y = "Book words came from",
fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
Not bad! We see that the model estimated with accuracy 91% of chapters in Volume 2 and 79% of chapters in Volume 1 (Corrected: 94% in vol 1 and 77% in vol 2)
In the articles by @ying_topics_2021 and @denny_text_2018 from this and previous weeks, we read about potential validation techniques.
In this section, we’ll be using the preText package
mentioned in @denny_text_2018 to see the
impact of different pre-processing choices on our text. Here, I am
adapting from a tutorial
by Matthew Denny.
First we need to reformat our text into a quanteda
corpus object.
# load in corpus of Tocequeville text data.
corp <- corpus(tocq, text_field = "text")
# use first 10 documents for example
documents <- corp[sample(1:30000,1000)]
# take a look at the document names
print(names(documents[1:10])) #what are these documents?
## [1] "text27303" "text23936" "text1470" "text16582" "text24195" "text28032"
## [7] "text458" "text6386" "text21648" "text17997"
And now we are ready to preprocess in different ways. Here, we are including n-grams so we are preprocessing the text in 128 different ways. This takes about ten minutes to run on a machine with 8GB RAM.
#factorial pre processing incuding n-grams
preprocessed_documents <- factorial_preprocessing(
documents,
use_ngrams = TRUE,
infrequent_term_threshold = 0.2,
verbose = FALSE)
We can then get the results of our pre-processing, comparing the distance between documents that have been processed in different ways.
#results of preprocessing 917 minutes)
preText_results <- preText(
preprocessed_documents,
dataset_name = "Tocqueville text",
distance_method = "cosine",
num_comparisons = 20,
verbose = FALSE)
And we can plot these accordingly.
preText_score_plot(preText_results)
#ctrl + alt + i to add code chunk
#load in the data
keyhar<- gutenberg_download(c(57592, 35534),
meta_fields = "author")
#convert data to DTM, new column for which author, don or har
keyhar_words <- keyhar %>%
mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
unnest_tokens(word, text) %>%
filter(!is.na(word)) %>%
count(author, word, sort = TRUE) %>%
ungroup() %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
keyhar_dtm <- keyhar_words %>%
cast_dtm(author, word, n)
tm::inspect(keyhar_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 13031)>>
## Non-/sparse entries: 16604/9458
## Sparsity : 36%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs children husband life love marriage married time wife woman women
## Hardy 188 189 270 179 171 171 90 335 107 81
## Key 325 70 523 731 300 73 186 84 369 288
2a) In the DTM we can see the difference in frequency for each word used by both Hardy and Key, with distinct differences being shown in terms like: love, wherein Key uses the word roughly 7x more than Hardy. The sparsity reflects the fact that we are looking at just two documents, as 36% is quite low compared to other data-frames which may have more missing values, this suggests our DTM is dense.
#estimating topic model
keyhar_lda <- LDA(keyhar_dtm, k = 6, control = list(seed = 1234))
#extract the beta (per topic per word prob)
keyhar_topics <- tidy(keyhar_lda, matrix = "beta")
head(keyhar_topics, n = 6)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 love 0.00689
## 2 2 love 0.00143
## 3 3 love 0.0238
## 4 4 love 0.0185
## 5 5 love 0.0126
## 6 6 love 0.0103
2b) The betas listed here represent the probability that the given term belongs to a given topic. So, here, we see that the term “love” is most likely to belong to topic 3 (b=.0238).
#plot the prob
keyhar_top_terms <- keyhar_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
keyhar_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) + #shows 4 in a row
scale_y_reordered() +
theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
2c) We divided the topics into 6 groups(k=6) this was because our data
are smaller compared to the DiA data which used k=10. This is a
visualization of the top terms in each topic ranked by their respective
betas. We considered summarizing each topic by the words listed under
them, but it we faced difficulty distinguishing some as their terms are
too similar (could be because we picked books based on topic so it makes
sense they will share many terms).
#relative word frequencies
tidy_keyhar <- keyhar %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
# Count most common words in both
tidy_keyhar %>%
count(word, sort = TRUE)
## # A tibble: 13,031 × 2
## word n
## <chr> <int>
## 1 love 910
## 2 life 793
## 3 children 513
## 4 woman 476
## 5 marriage 471
## 6 wife 419
## 7 women 369
## 8 time 276
## 9 husband 259
## 10 married 244
## # ℹ 13,021 more rows
bookfreq <- tidy_keyhar %>%
mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion)
ggplot(bookfreq, aes(x = Key, y = Hardy, color = abs(Key - Hardy))) + #absolute difference in frequency between Key and hardy
geom_abline(color = "black", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "red") +
theme_tufte(base_family = "Helvetica") +
theme(legend.position="none",
strip.background = element_blank(),
strip.text.x = element_blank()) +
labs(x = "Key", y = "Hardy") +
coord_equal()
## Warning: Removed 8776 rows containing missing values (`geom_point()`).
## Warning: Removed 8777 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
2d) This frequency graph looks at the words most used in Key’s writing
versus Hardy’s. Terms like “Blood” which is bright red and on Key’s side
of the graph indicates that the termis used more frequently in Key’s
writing and because of the color suggests she uses it a lot more
frequently. Note: Red = high abs value between word usage, Green = low
abs value between word usage, Grey = middle ground between usage, like
if one uses a word slightly more often than the other???
#split into chapters
keyhar <- keyhar %>%
filter(!is.na(text))
# Divide into documents, each representing one chapter
keyhar_chapter <- keyhar %>%
mutate(author = ifelse(gutenberg_id==57592, "Key", "Hardy")) %>%
group_by(author) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, author, chapter)
# Split into words
keyhar_chapter_word <- keyhar_chapter %>%
unnest_tokens(word, text)
# Find document-word count
keyhar_word_counts <- keyhar_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
keyhar_word_counts
## # A tibble: 36,655 × 3
## document word n
## <chr> <chr> <int>
## 1 Key_2 love 175
## 2 Key_8 love 145
## 3 Key_1 love 135
## 4 Hardy_29 8vo 107
## 5 Key_1 life 96
## 6 Key_8 life 96
## 7 Key_2 woman 94
## 8 Hardy_29 cloth 89
## 9 Key_6 children 85
## 10 Hardy_29 6d 83
## # ℹ 36,645 more rows
# Cast into DTM format for LDA analysis
keyhar_chapters_dtm <- keyhar_word_counts %>%
cast_dtm(document, word, n)
tm::inspect(keyhar_chapters_dtm)
## <<DocumentTermMatrix (documents: 38, terms: 12834)>>
## Non-/sparse entries: 36655/451037
## Sparsity : 92%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs children husband life love marriage people time wife woman women
## Hardy_29 16 14 65 20 23 5 13 56 4 10
## Key_1 16 3 96 135 71 13 19 7 11 14
## Key_2 2 4 53 175 30 5 21 6 94 56
## Key_3 35 1 49 78 33 21 21 5 18 7
## Key_4 16 5 32 81 23 10 12 5 21 7
## Key_5 10 2 58 59 22 5 11 3 49 49
## Key_6 85 8 52 17 3 7 26 4 40 39
## Key_7 21 2 47 11 2 1 18 4 73 72
## Key_8 77 7 96 145 63 48 25 10 22 10
## Key_9 62 38 33 22 50 17 25 39 31 23
2e) When splitting by chapters, our sparsity jumps from 36% to 92%, which just affirms that we do have many values categorized as 0 in this DTM. Can we have an explanation her, donhar returned different results?
#restimate topic model with new dtm k=2,
keyhar_chapters_lda <- LDA(keyhar_chapters_dtm, k = 2, control = list(seed = 1234))
#gamma val estimates
keyhar_chapters_gamma <- tidy(keyhar_chapters_lda, matrix = "gamma")
keyhar_chapters_gamma
## # A tibble: 76 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Key_2 1 0.996
## 2 Key_8 1 1.00
## 3 Key_1 1 1.00
## 4 Hardy_29 1 0.00000653
## 5 Key_6 1 1.00
## 6 Key_4 1 1.00
## 7 Key_3 1 1.00
## 8 Key_7 1 1.00
## 9 Key_9 1 1.00
## 10 Key_5 1 1.00
## # ℹ 66 more rows
2f) The gamma values returned here show us the estimated proportion of words within a given chapter allocated to a given author. This means that within topic 1, key_8 , which we think is chapter 8 of key’s writing demonstrates the highest proportion of words that are likely to be allocated to key. Key_8 is most domonstrative of Key’s writing.
#unsupervised learning distinguish models
#first separate the document name into title and chapter
keyhar_chapters_gamma <- keyhar_chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
keyhar_chapter_classifications <- keyhar_chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
keyhar_book_topics <- keyhar_chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
keyhar_chapter_classifications %>%
inner_join(keyhar_book_topics, by = "topic") %>%
filter(title != consensus)
## # A tibble: 1 × 5
## title chapter topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 Hardy 15 1 0.599 Key
# Look document-word pairs were to see which words in each documents were assigned
# to a given topic
assignments_kh <- augment(keyhar_chapters_lda, data = keyhar_chapters_dtm)
assignments_kh
## # A tibble: 36,655 × 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 Key_2 love 175 1
## 2 Key_8 love 145 1
## 3 Key_1 love 135 1
## 4 Hardy_29 love 20 2
## 5 Key_6 love 17 1
## 6 Key_4 love 81 1
## 7 Key_3 love 78 1
## 8 Key_7 love 11 1
## 9 Key_9 love 22 1
## 10 Key_5 love 59 1
## # ℹ 36,645 more rows
assignments_kh <- assignments_kh %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(keyhar_book_topics, by = c(".topic" = "topic"))
assignments_kh %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) +
geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
theme_tufte(base_family = "Helvetica") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words assigned to",
y = "Book words came from",
fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
2g) After running the unsupervised learning model We see that the model
estimated with accuracy 92.5% of the words in Hardy’s writing and and
100% of words in Key. This model is promising as it shows us that it
will preform well if we were to give it some more unseen text from
Hardy, and it would be able to categorize it as his work. However, the
100% success rate in Key’s writing may show that the model is overfit
and has merely learned the data; this would mean it is not going to be
able to successfully categorize Key’s writing as her own if we were to
show it an unseen sample.
preText on the new book(s) of your choice.#reformat text into quanteda corpus
# load in corpus of keyhar text data.
corp <- corpus(keyhar, text_field = "text")
# use first 10 documents for example
documents <- corp[sample(1:3000,100)]
# take a look at the document names
print(names(documents[1:10]))
## [1] "text2203" "text1761" "text1805" "text1986" "text615" "text2321"
## [7] "text2445" "text2752" "text586" "text872"
#n-gram preprocessing
preprocessed_documents <- factorial_preprocessing(
documents,
use_ngrams = TRUE,
infrequent_term_threshold = 0.2, #the frequency of different words in the documents less than 20%
verbose = FALSE)
#results of preprocessing
preText_results <- preText(
preprocessed_documents,
dataset_name = "Keyhar text",
distance_method = "cosine",
num_comparisons = 20,
verbose = FALSE)
#plot the results
preText_score_plot(preText_results)
##—————–Donovan and Hardy (first attempt, issue with accuracy) ———————-
#get the two books
Donhar <- gutenberg_download(c(53368,35534),
meta_fields = "author")
Donhar
## # A tibble: 12,212 × 3
## gutenberg_id text author
## <int> <chr> <chr>
## 1 35534 "[Transcriber's note: The author's spelling has been mai… Hardy…
## 2 35534 "" Hardy…
## 3 35534 "+ signs around words indicate the use of a different fo… Hardy…
## 4 35534 "" Hardy…
## 5 35534 "In the word \"Puranic\", the \"a\" is overlined in the … Hardy…
## 6 35534 "" Hardy…
## 7 35534 "" Hardy…
## 8 35534 "" Hardy…
## 9 35534 "" Hardy…
## 10 35534 "_HOW TO BE HAPPY THOUGH MARRIED._" Hardy…
## # ℹ 12,202 more rows
#prepare the document
Donhar_words <- Donhar %>%
mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
unnest_tokens(word, text) %>%
filter(!is.na(word)) %>%
count(author, word, sort = TRUE) %>%
ungroup() %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
Donhar_words
## # A tibble: 12,741 × 3
## author word n
## <chr> <chr> <int>
## 1 Hardy wife 335
## 2 Hardy life 270
## 3 Hardy husband 189
## 4 Hardy children 188
## 5 Hardy love 179
## 6 Hardy marriage 171
## 7 Hardy married 171
## 8 Hardy home 136
## 9 Hardy day 127
## 10 Donovan marry 121
## # ℹ 12,731 more rows
#turn it into a document term matrix
Donhar_dtm <- Donhar_words %>%
cast_dtm(author, word, n)
tm::inspect(Donhar_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 10644)>>
## Non-/sparse entries: 12741/8547
## Sparsity : 40%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs children home husband life love marriage married marry wife woman
## Donovan 16 23 13 62 42 38 19 121 27 44
## Hardy 188 136 189 270 179 171 171 40 335 107
#specify how many topics are there
Donhar_lda <- LDA(Donhar_dtm, k = 6, control = list(seed = 1234))
#extra the per-word-per-topic probabilities
Donhar_topics <- tidy(Donhar_lda, matrix = "beta")
head(Donhar_topics, n = 6)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 wife 0.00864
## 2 2 wife 0.0000973
## 3 3 wife 0.0116
## 4 4 wife 0.0160
## 5 5 wife 0.000839
## 6 6 wife 0.0136
##so the term "wife" is most likely to belong to topic 4
#plot the results
Donhar_top_terms <- Donhar_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
Donhar_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
scale_y_reordered() +
theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
##have a look at whether there are specific word that can distinguish the two books
tidy_Donhar <- Donhar %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_Donhar %>%
count(word, sort = TRUE)
## # A tibble: 10,644 × 2
## word n
## <chr> <int>
## 1 wife 362
## 2 life 332
## 3 love 221
## 4 marriage 209
## 5 children 204
## 6 husband 202
## 7 married 190
## 8 marry 161
## 9 home 159
## 10 woman 151
## # ℹ 10,634 more rows
bookfreq3 <- tidy_Donhar %>%
mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion)
head(bookfreq3)
## # A tibble: 6 × 3
## word Donovan Hardy
## <chr> <dbl> <dbl>
## 1 a NA 0.000213
## 2 aback NA 0.0000305
## 3 abandon 0.000141 0.0000305
## 4 abandoned NA 0.0000305
## 5 abandons NA 0.0000305
## 6 abated NA 0.0000305
ggplot(bookfreq3, aes(x = Donovan, y = Hardy, color = abs(Donovan - Hardy))) +
geom_abline(color = "orange", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "purple") +
theme_tufte(base_family = "Helvetica") +
theme(legend.position="none",
strip.background = element_blank(),
strip.text.x = element_blank()) +
labs(x = "Donovan", y = "Hardy") +
coord_equal()
## Warning: Removed 8046 rows containing missing values (`geom_point()`).
## Warning: Removed 8047 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
##split into chapter
Donhar <- Donhar %>%
filter(!is.na(text))
Donhar
## # A tibble: 12,212 × 3
## gutenberg_id text author
## <int> <chr> <chr>
## 1 35534 "[Transcriber's note: The author's spelling has been mai… Hardy…
## 2 35534 "" Hardy…
## 3 35534 "+ signs around words indicate the use of a different fo… Hardy…
## 4 35534 "" Hardy…
## 5 35534 "In the word \"Puranic\", the \"a\" is overlined in the … Hardy…
## 6 35534 "" Hardy…
## 7 35534 "" Hardy…
## 8 35534 "" Hardy…
## 9 35534 "" Hardy…
## 10 35534 "_HOW TO BE HAPPY THOUGH MARRIED._" Hardy…
## # ℹ 12,202 more rows
#it can't be clear as some NAs stay in the dataset
# Divide into documents, each representing one chapter
Donhar_chapter <- Donhar %>%
mutate(author = ifelse(gutenberg_id==53368, "Donovan", "Hardy")) %>%
group_by(author) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, author, chapter)
Donhar_chapter
## # A tibble: 9,864 × 3
## gutenberg_id text document
## <int> <chr> <chr>
## 1 35534 "CHAPTER I." Hardy_1
## 2 35534 "" Hardy_1
## 3 35534 "HOW TO BE HAPPY _THOUGH_ MARRIED." Hardy_1
## 4 35534 "" Hardy_1
## 5 35534 " \"How delicious is the winning" Hardy_1
## 6 35534 " Of a kiss at love's beginning," Hardy_1
## 7 35534 " When two mutual hearts are sighing" Hardy_1
## 8 35534 " For the knot there's no untying!\"--_T. Campbell._" Hardy_1
## 9 35534 "" Hardy_1
## 10 35534 " \"Deceive not thyself by over-expecting happines… Hardy_1
## # ℹ 9,854 more rows
# Split into words
Donhar_chapter_word <- Donhar_chapter %>%
unnest_tokens(word, text)
Donhar_chapter_word
## # A tibble: 86,377 × 3
## gutenberg_id document word
## <int> <chr> <chr>
## 1 35534 Hardy_1 chapter
## 2 35534 Hardy_1 i
## 3 35534 Hardy_1 how
## 4 35534 Hardy_1 to
## 5 35534 Hardy_1 be
## 6 35534 Hardy_1 happy
## 7 35534 Hardy_1 _though_
## 8 35534 Hardy_1 married
## 9 35534 Hardy_1 how
## 10 35534 Hardy_1 delicious
## # ℹ 86,367 more rows
# Find document-word counts
Donhar_word_counts <- Donhar_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
Donhar_word_counts
## # A tibble: 20,995 × 3
## document word n
## <chr> <chr> <int>
## 1 Hardy_29 8vo 107
## 2 Hardy_29 cloth 89
## 3 Hardy_29 6d 83
## 4 Hardy_29 author 76
## 5 Hardy_29 book 65
## 6 Hardy_29 life 65
## 7 Hardy_29 crown 64
## 8 Hardy_29 edition 59
## 9 Hardy_29 wife 56
## 10 Hardy_19 children 42
## # ℹ 20,985 more rows
# Cast into DTM format for LDA analysis
Donhar_chapters_dtm <- Donhar_word_counts %>%
cast_dtm(document, word, n)
tm::inspect(Donhar_chapters_dtm)
## <<DocumentTermMatrix (documents: 29, terms: 9390)>>
## Non-/sparse entries: 20995/251315
## Sparsity : 92%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs children day home husband life love marriage married people wife
## Hardy_10 3 8 10 22 4 14 12 4 2 31
## Hardy_11 0 10 6 17 13 11 6 10 6 21
## Hardy_19 42 4 10 6 6 8 0 0 3 6
## Hardy_2 7 3 5 3 21 4 25 15 3 12
## Hardy_24 1 12 13 22 18 2 2 7 3 21
## Hardy_27 4 2 3 5 4 1 2 1 4 5
## Hardy_29 16 18 12 14 65 20 23 16 5 56
## Hardy_3 2 4 1 3 11 4 13 5 1 37
## Hardy_4 1 3 2 11 6 5 6 3 1 29
## Hardy_6 1 8 1 5 6 8 14 10 11 9
##no Key data are shown in the chart, maybe cuz the Key data is too small compared to the hardy data
#
Donhar_chapters_lda <- LDA(Donhar_chapters_dtm, k = 2, control = list(seed = 1234))
#
Donhar_chapters_gamma <- tidy(Donhar_chapters_lda, matrix = "gamma")
Donhar_chapters_gamma
## # A tibble: 58 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Hardy_29 1 1.00
## 2 Hardy_19 1 1.00
## 3 Hardy_14 1 0.0000192
## 4 Hardy_18 1 1.00
## 5 Hardy_3 1 0.0000145
## 6 Hardy_10 1 0.0000134
## 7 Hardy_27 1 0.0000174
## 8 Hardy_4 1 0.769
## 9 Hardy_16 1 0.0000215
## 10 Hardy_2 1 1.00
## # ℹ 48 more rows
#
# First separate the document name into title and chapter
Donhar_chapters_gamma <- Donhar_chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
Donhar_chapter_classifications <- Donhar_chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
Donhar_book_topics <- Donhar_chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
Donhar_chapter_classifications %>%
inner_join(Donhar_book_topics, by = "topic") %>%
filter(title != consensus)
## # A tibble: 0 × 5
## # ℹ 5 variables: title <chr>, chapter <int>, topic <int>, gamma <dbl>,
## # consensus <chr>
# Look document-word pairs were to see which words in each documents were assigned
# to a given topic
assignments3 <- augment(Donhar_chapters_lda, data = Donhar_chapters_dtm)
assignments3
## # A tibble: 20,995 × 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 Hardy_29 8vo 107 1
## 2 Hardy_29 cloth 89 1
## 3 Hardy_29 6d 83 1
## 4 Hardy_29 author 76 1
## 5 Hardy_14 author 1 2
## 6 Hardy_3 author 1 2
## 7 Hardy_10 author 1 2
## 8 Hardy_4 author 1 1
## 9 Hardy_26 author 1 2
## 10 Hardy_29 book 65 1
## # ℹ 20,985 more rows
assignments3 <- assignments3 %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(Donhar_book_topics, by = c(".topic" = "topic"))
assignments3 %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) +
geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
theme_tufte(base_family = "Helvetica") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words assigned to",
y = "Book words came from",
fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
##————BRONTES (first issue with accuracy fixed, but this accuracy model performs poorly, overtuning)—————-
##wondering whether the problem is the samples are too short, use the Bromtes works as samples.
#get the two books
bronte <- gutenberg_download(c(768,1260),
meta_fields = "author")
#prepare the document
bronte_words <- bronte %>%
mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
unnest_tokens(word, text) %>%
filter(!is.na(word)) %>%
count(author, word, sort = TRUE) %>%
ungroup() %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
#turn it into a document term matrix
bronte_dtm <- bronte_words %>%
cast_dtm(author, word, n)
tm::inspect(bronte_dtm)
## <<DocumentTermMatrix (documents: 2, terms: 15328)>>
## Non-/sparse entries: 21265/9391
## Sparsity : 31%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs catherine day don’t heathcliff house jane linton miss sir time
## Charlotte 1 232 155 0 182 341 0 310 316 244
## Emily 336 105 180 421 142 0 346 129 43 128
#specify how many topics are there
bronte_lda <- LDA(bronte_dtm, k = 6, control = list(seed = 1234))
#extra the per-word-per-topic probabilities
bronte_topics <- tidy(bronte_lda, matrix = "beta")
head(bronte_topics, n = 6)
## # A tibble: 6 × 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 heathcliff 1.86e-27
## 2 2 heathcliff 4.89e-20
## 3 3 heathcliff 6.22e- 3
## 4 4 heathcliff 1.16e- 2
## 5 5 heathcliff 5.61e-15
## 6 6 heathcliff 8.71e-30
#it's interesting that the first word always seems to belong to the fourth chapter!
#plot the results
bronte_top_terms <- bronte_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
bronte_top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free", ncol = 3) +
scale_y_reordered() +
theme_tufte(base_family = "Helvetica")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
##have a look at whether there are specific word that can distinguish the two books
tidy_bronte <- bronte %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
## Count most common words in both
tidy_bronte %>%
count(word, sort = TRUE)
## # A tibble: 15,328 × 2
## word n
## <chr> <int>
## 1 miss 439
## 2 heathcliff 421
## 3 time 372
## 4 sir 359
## 5 linton 346
## 6 jane 341
## 7 catherine 337
## 8 day 337
## 9 don’t 335
## 10 house 324
## # ℹ 15,318 more rows
bookfreq2 <- tidy_bronte %>%
mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>%
count(author, word) %>%
group_by(author) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(author, proportion)
head(bookfreq2)
## # A tibble: 6 × 3
## word Charlotte Emily
## <chr> <dbl> <dbl>
## 1 a 0.000170 0.0000473
## 2 abaht NA 0.0000237
## 3 abandon 0.0000463 0.0000237
## 4 abandoned 0.000123 0.0000947
## 5 abandonment 0.0000309 0.0000237
## 6 abashed NA 0.0000237
ggplot(bookfreq2, aes(x = Emily, y = Charlotte, color = abs(Emily - Charlotte))) +
geom_abline(color = "orange", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "purple") +
theme_tufte(base_family = "Helvetica") +
theme(legend.position="none",
strip.background = element_blank(),
strip.text.x = element_blank()) +
labs(x = "Emily", y = "Charlotte") +
coord_equal()
## Warning: Removed 8982 rows containing missing values (`geom_point()`).
## Warning: Removed 8983 rows containing missing values (`geom_text()`).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
##split into chapter
bronte <- bronte %>%
filter(!is.na(text))
# Divide into documents, each representing one chapter
bronte_chapter <- bronte %>%
mutate(author = ifelse(gutenberg_id==768, "Emily", "Charlotte")) %>%
group_by(author) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, author, chapter)
# Split into words
bronte_chapter_word <- bronte_chapter %>%
unnest_tokens(word, text)
# Find document-word counts
bronte_word_counts <- bronte_chapter_word %>%
anti_join(stop_words) %>%
count(document, word, sort = TRUE) %>%
ungroup()
## Joining with `by = join_by(word)`
bronte_word_counts
## # A tibble: 73,614 × 3
## document word n
## <chr> <chr> <int>
## 1 Charlotte_27 jane 53
## 2 Charlotte_38 jane 51
## 3 Charlotte_38 sir 40
## 4 Emily_10 heathcliff 37
## 5 Charlotte_24 sir 35
## 6 Charlotte_34 st 35
## 7 Charlotte_18 ingram 34
## 8 Emily_21 linton 34
## 9 Emily_27 catherine 33
## 10 Charlotte_34 john 32
## # ℹ 73,604 more rows
# Cast into DTM format for LDA analysis
bronte_chapters_dtm <- bronte_word_counts %>%
cast_dtm(document, word, n)
tm::inspect(bronte_chapters_dtm)
## <<DocumentTermMatrix (documents: 73, terms: 15241)>>
## Non-/sparse entries: 73614/1038979
## Sparsity : 93%
## Maximal term length: 18
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs catherine day don’t heathcliff house jane linton miss sir time
## Charlotte_17 0 8 9 0 5 0 0 16 7 12
## Charlotte_21 0 17 3 0 7 21 0 17 28 10
## Charlotte_24 0 12 17 0 2 22 0 7 35 11
## Charlotte_27 0 9 10 0 7 53 0 1 23 16
## Charlotte_28 0 10 4 0 14 1 0 0 1 8
## Charlotte_34 0 13 5 0 11 20 0 1 4 10
## Charlotte_38 0 7 6 0 12 51 0 4 40 5
## Emily_10 19 7 10 37 6 0 27 7 3 8
## Emily_17 13 6 10 27 7 0 9 3 1 5
## Emily_21 22 10 12 21 8 0 34 17 0 11
#re-estimate the topic model with this DTM object
bronte_chapters_lda <- LDA(bronte_chapters_dtm, k = 2, control = list(seed = 1234))
#get the gamma value--the estimated proportion of words within a given chapter allocated to a given volume
bronte_chapters_gamma <- tidy(bronte_chapters_lda, matrix = "gamma")
bronte_chapters_gamma
## # A tibble: 146 × 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Charlotte_27 1 0.444
## 2 Charlotte_38 1 0.472
## 3 Emily_10 1 0.500
## 4 Charlotte_24 1 0.530
## 5 Charlotte_34 1 0.418
## 6 Charlotte_18 1 0.495
## 7 Emily_21 1 0.596
## 8 Emily_27 1 0.559
## 9 Charlotte_4 1 0.591
## 10 Charlotte_5 1 0.305
## # ℹ 136 more rows
## not quite understand how it works
# First separate the document name into title and chapter
bronte_chapters_gamma <- bronte_chapters_gamma %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE)
bronte_chapter_classifications <- bronte_chapters_gamma %>%
group_by(title, chapter) %>%
top_n(1, gamma) %>%
ungroup()
bronte_book_topics <- bronte_chapter_classifications %>%
count(title, topic) %>%
group_by(title) %>%
top_n(1, n) %>%
ungroup() %>%
transmute(consensus = title, topic)
bronte_chapter_classifications %>%
inner_join(bronte_book_topics, by = "topic") %>%
filter(title != consensus)
## # A tibble: 31 × 5
## title chapter topic gamma consensus
## <chr> <int> <int> <dbl> <chr>
## 1 Charlotte 24 1 0.530 Emily
## 2 Charlotte 4 1 0.591 Emily
## 3 Charlotte 20 1 0.509 Emily
## 4 Charlotte 21 1 0.539 Emily
## 5 Charlotte 3 1 0.579 Emily
## 6 Charlotte 11 1 0.558 Emily
## 7 Charlotte 16 1 0.593 Emily
## 8 Charlotte 13 1 0.556 Emily
## 9 Charlotte 29 1 0.698 Emily
## 10 Charlotte 10 1 0.503 Emily
## # ℹ 21 more rows
# Look document-word pairs were to see which words in each documents were assigned to a given topic
assignments2 <- augment(bronte_chapters_lda, data = bronte_chapters_dtm)
assignments2
## # A tibble: 73,614 × 4
## document term count .topic
## <chr> <chr> <dbl> <dbl>
## 1 Charlotte_27 jane 53 2
## 2 Charlotte_38 jane 51 2
## 3 Charlotte_24 jane 22 2
## 4 Charlotte_34 jane 20 2
## 5 Charlotte_4 jane 14 1
## 6 Charlotte_5 jane 3 2
## 7 Charlotte_25 jane 21 2
## 8 Charlotte_26 jane 5 2
## 9 Charlotte_20 jane 24 2
## 10 Charlotte_21 jane 21 2
## # ℹ 73,604 more rows
assignments2 <- assignments2 %>%
separate(document, c("title", "chapter"), sep = "_", convert = TRUE) %>%
inner_join(bronte_book_topics, by = c(".topic" = "topic"))
assignments2 %>%
count(title, consensus, wt = count) %>%
group_by(title) %>%
mutate(percent = n / sum(n)) %>%
ggplot(aes(consensus, title, fill = percent)) +
geom_tile() +
scale_fill_gradient2(high = "red", label = percent_format()) +
geom_text(aes(x = consensus, y = title, label = scales::percent(percent))) +
theme_tufte(base_family = "Helvetica") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),
panel.grid = element_blank()) +
labs(x = "Book words assigned to",
y = "Book words came from",
fill = "% of assignments")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
##reformat the data into a “quanteda” corpus object
# load in corpus of Tocequeville text data.
corp2 <- corpus(bronte, text_field = "text")
# use first 10 documents for example
documents2 <- corp[sample(1:3000,100)]
# take a look at the document names
print(names(documents2[1:10]))
## [1] "text2204" "text1118" "text2610" "text751" "text2732" "text1426"
## [7] "text1675" "text2698" "text1091" "text1322"
#preprocessing the data in 128 different ways
preprocessed_documents2 <- factorial_preprocessing(
documents2,
use_ngrams = TRUE,
infrequent_term_threshold = 0.2,
verbose = FALSE)
## Preprocessing 100 documents 128 different ways...
#compare the distance between documents that have been processed in different ways
preText_results2 <- preText(
preprocessed_documents2,
dataset_name = "Bronte text",
distance_method = "cosine",
num_comparisons = 20,
verbose = FALSE)
## Generating document distances...
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Warning in stats::cmdscale(distances, k = dimensions): only 0 of the first 2
## eigenvalues are > 0
## Generating preText Scores...
## Generating regression results..
## The R^2 for this model is: 0.4171211
## Regression results (negative coefficients imply less risk):
## Variable Coefficient SE
## 1 Intercept 0.088 0.008
## 2 Remove Punctuation -0.020 0.006
## 3 Remove Numbers 0.001 0.006
## 4 Lowercase 0.000 0.006
## 5 Stemming 0.001 0.006
## 6 Remove Stopwords -0.008 0.006
## 7 Remove Infrequent Terms 0.048 0.006
## 8 Use NGrams -0.008 0.006
## Complete in: 15.25 seconds...
#plot accordingly
preText_score_plot(preText_results2)
## Warning in ggplot2::geom_point(ggplot2::aes(x = Variable, y = Coefficient), :
## Ignoring unknown parameters: `linewidth`